ForestSearch Package Architecture

Function Relationships and Module Organization

Author

ForestSearch Development Team

Published

January 26, 2026

1 Overview

ForestSearch is a comprehensive R package for exploratory subgroup identification in clinical trials with survival endpoints. The package implements advanced methods from León et al. (2024) published in Statistics in Medicine.

Key Methods:

  • Generalized Random Forests (GRF) for variable importance
  • LASSO regularization for dimension reduction
  • Exhaustive combinatorial search for subgroup discovery
  • Split-sample validation for consistency assessment
  • Bootstrap bias correction using infinitesimal jackknife methods

2 High-Level Workflow

The ForestSearch algorithm proceeds through five main phases:

flowchart LR
    A["📊 Input Data"] --> B["🔬 Variable\nSelection"]
    B --> C["📋 Data\nPreparation"]
    C --> D["🔍 Subgroup\nSearch"]
    D --> E["✅ Consistency\nEvaluation"]
    E --> F["📈 Output"]
    
    style A fill:#E3F2FD,stroke:#1976D2,stroke-width:2px
    style B fill:#E8F4FD,stroke:#2E86AB,stroke-width:2px
    style C fill:#FFF3E0,stroke:#F57C00,stroke-width:2px
    style D fill:#FCE4EC,stroke:#C2185B,stroke-width:2px
    style E fill:#E8F5E9,stroke:#388E3C,stroke-width:2px
    style F fill:#F3E5F5,stroke:#7B1FA2,stroke-width:2px

After the core analysis, optional validation steps include:

flowchart LR
    A["✅ ForestSearch\nResult"] --> B["🔄 Bootstrap\nBias Correction"]
    A --> C["📈 K-Fold\nCross-Validation"]
    B --> D["📊 Final\nReport"]
    C --> D
    
    style A fill:#E8F5E9,stroke:#388E3C,stroke-width:2px
    style B fill:#E8F4FD,stroke:#2E86AB,stroke-width:2px
    style C fill:#FCE4EC,stroke:#C2185B,stroke-width:2px
    style D fill:#FFFDE7,stroke:#F9A825,stroke-width:2px


3 Phase 1: Main Entry Point

3.1 forestsearch() - The Orchestrator

Location: R/forest_search_revised.R

flowchart TB
    FS[["forestsearch()"]]
    
    FS --> P1["1. Validate Inputs"]
    FS --> P2["2. GRF Variable Selection"]
    FS --> P3["3. LASSO Selection"]
    FS --> P4["4. Data Preparation"]
    FS --> P5["5. Subgroup Search"]
    FS --> P6["6. Consistency Evaluation"]
    FS --> P7["7. Return Results"]
    
    P1 --> P2
    P2 --> P3
    P3 --> P4
    P4 --> P5
    P5 --> P6
    P6 --> P7
    
    style FS fill:#2E86AB,stroke:#1a5276,stroke-width:3px,color:#fff

Key Parameters:

Parameter Description Default
df.analysis Analysis dataset Required
confounders.name Candidate variables NULL
use_lasso Enable LASSO TRUE
use_grf Enable GRF TRUE
hr.threshold Min HR for candidates 1.25
pconsistency.threshold Consistency requirement 0.90
fs.splits Validation splits 1000
maxk Max factors per subgroup 2

4 Phase 2: Variable Selection

4.1 GRF-Based Selection

Function: grf.subg.harm.survival() Location: R/grf_main.R

flowchart TB
    subgraph GRF["GRF Variable Selection"]
        A["Input: Data + Confounders"] --> B["fit_causal_forest()"]
        B --> C["Compute Variable\nImportance"]
        C --> D["Fit Policy Trees\n(depth 1, 2, 3)"]
        D --> E["select_best_subgroup()"]
        E --> F["extract_all_tree_cuts()"]
        F --> G["Output: GRF Cuts"]
    end
    
    style A fill:#E3F2FD,stroke:#1976D2
    style G fill:#E8F5E9,stroke:#388E3C

GRF Helper Functions:

Function Purpose
create_grf_config() Initialize parameters
fit_causal_forest() Fit causal survival forest
compute_node_metrics() Metrics per tree node
select_best_subgroup() Choose optimal subgroup
extract_all_tree_cuts() Get cut expressions

4.2 LASSO-Based Selection

Function: lasso_selection() Location: R/get_FSdata_helpers.R

flowchart TB
    subgraph LASSO["LASSO Variable Selection"]
        A["Input: Confounders"] --> B["cv.glmnet()\nCox-LASSO"]
        B --> C["Select lambda.min"]
        C --> D["Extract Coefficients"]
        D --> E{"Coef ≠ 0?"}
        E -->|Yes| F["Selected Variables"]
        E -->|No| G["Omitted Variables"]
    end
    
    style A fill:#E3F2FD,stroke:#1976D2
    style F fill:#E8F5E9,stroke:#388E3C
    style G fill:#FFEBEE,stroke:#C62828


5 Phase 3: Data Preparation

5.1 get_FSdata() - Creating Binary Indicators

Location: R/get_FSdata_refactored.R

flowchart TB
    subgraph DataPrep["Data Preparation Pipeline"]
        A["Raw Data"] --> B["Classify Variables"]
        B --> C{"Continuous?"}
        C -->|Yes| D["Create Cut\nIndicators"]
        C -->|No| E["Create Dummy\nVariables"]
        D --> F["Apply LASSO\nFilter"]
        E --> F
        F --> G["Apply GRF\nCuts"]
        G --> H["Handle Forced\nCuts"]
        H --> I["Binary Indicator\nMatrix Z"]
    end
    
    style A fill:#E3F2FD,stroke:#1976D2
    style I fill:#E8F5E9,stroke:#388E3C

Data Preparation Helpers:

Function Purpose
is.continuous() Check variable type
dummy() Create dummy variables
get_conf_force() Generate forced cuts
evaluate_cuts_once() Evaluate & cache cuts
FS_labels() Convert q-codes to labels
filter_by_lassokeep() Apply LASSO filter

7 Phase 5: Consistency Evaluation

7.1 subgroup.consistency() - Split-Sample Validation

Location: R/subgroup_consistency_main.R

Two algorithms are available:

7.2 Fixed Algorithm (Default)

flowchart TB
    subgraph Fixed["Fixed Consistency Algorithm"]
        A["Candidate\nSubgroup"] --> B["Repeat n.splits times:"]
        B --> C["Random 50/50\nSplit"]
        C --> D["Compute HR\nin Split 1"]
        C --> E["Compute HR\nin Split 2"]
        D --> F{"Both HR >\nhr.consistency?"}
        E --> F
        F -->|Yes| G["Count as\nConsistent"]
        F -->|No| H["Count as\nInconsistent"]
        G --> B
        H --> B
        B -->|Done| I["Pcons =\nConsistent / Total"]
    end
    
    style A fill:#E3F2FD,stroke:#1976D2
    style I fill:#E8F5E9,stroke:#388E3C

7.3 Two-Stage Algorithm (Optional)

flowchart TB
    subgraph TwoStage["Two-Stage Consistency Algorithm"]
        A["Candidate"] --> B["Stage 1:\nQuick Screen\n(30 splits)"]
        B --> C{"Pcons >\nscreen.threshold?"}
        C -->|No| D["REJECT"]
        C -->|Yes| E["Stage 2:\nBatched Evaluation"]
        E --> F["Run batch of\n20 splits"]
        F --> G["Compute Wilson CI"]
        G --> H{"CI lower >\nthreshold?"}
        H -->|Yes| I["ACCEPT"]
        H -->|No| J{"CI upper <\nthreshold?"}
        J -->|Yes| D
        J -->|No| E
    end
    
    style A fill:#E3F2FD,stroke:#1976D2
    style D fill:#FFEBEE,stroke:#C62828
    style I fill:#E8F5E9,stroke:#388E3C

Consistency Helper Functions:

Function Purpose
evaluate_subgroup_consistency() Fixed algorithm
evaluate_consistency_twostage() Two-stage algorithm
get_split_hr_fast() Fast HR for split
wilson_ci() Wilson confidence interval
early_stop_decision() Early stopping logic
sort_subgroups() Sort by sg_focus
extract_subgroup() Get subgroup definition

8 Phase 6: Bootstrap Bias Correction

8.1 forestsearch_bootstrap_dofuture()

Location: R/bootstrap_dofuture_main.R

flowchart TB
    subgraph Bootstrap["Bootstrap Bias Correction"]
        A["Original\nResult"] --> B["Generate B\nBootstrap Samples"]
        B --> C["For each bootstrap:"]
        C --> D["Run forestsearch()\non bootstrap data"]
        D --> E["Compute HRs:\nH*, H*_obs, H*_*"]
        E --> C
        C -->|Done| F["Aggregate Results"]
        F --> G["Compute Bias\nCorrection"]
        G --> H["Adjusted HR\nEstimate"]
    end
    
    style A fill:#E3F2FD,stroke:#1976D2
    style H fill:#E8F5E9,stroke:#388E3C

Bias Correction Formulas:

Method Formula
Simple Optimism \(H_{adj1} = H_{obs} - (H^*_* - H^*_{obs})\)
Double Bootstrap \(H_{adj2} = 2H_{obs} - (H_* + H^*_* - H^*_{obs})\)

Bootstrap Helper Functions:

Function Location
bootstrap_results() Coordinate iterations
run_single_bootstrap() One bootstrap iteration
summarize_bootstrap_results() Aggregate statistics
summarize_bootstrap_subgroups() Subgroup agreement

9 Phase 7: Cross-Validation

9.1 forestsearch_Kfold()

Location: R/forestsearch_cross-validation.R

flowchart TB
    subgraph CV["K-Fold Cross-Validation"]
        A["Full Dataset"] --> B["Split into\nK Folds"]
        B --> C["For fold k = 1 to K:"]
        C --> D["Train: K-1 folds"]
        C --> E["Test: fold k"]
        D --> F["forestsearch()\non training"]
        F --> G["Evaluate on\ntest fold"]
        G --> C
        C -->|Done| H["forestsearch_KfoldOut()\nAggregate Results"]
    end
    
    style A fill:#E3F2FD,stroke:#1976D2
    style H fill:#E8F5E9,stroke:#388E3C

CV Helper Functions:

Function Purpose
forestsearch_KfoldOut() Summarize K-fold results
forestsearch_tenfold() Repeated K-fold wrapper
find_covariate_any_match() Match covariates across folds

10 Phase 8: Output & Visualization

10.1 S3 Methods and Summary Functions

flowchart TB
    subgraph Output["Output Functions"]
        A["forestsearch\nresult"] --> B["print.forestsearch()"]
        A --> C["summary.forestsearch()"]
        A --> D["sg_tables()"]
        A --> E["plot_subgroup_results_forestplot()"]
        
        B --> F["Console\nSummary"]
        C --> G["Detailed\nStatistics"]
        D --> H["gt Tables"]
        E --> I["Forest Plot"]
    end
    
    style A fill:#E3F2FD,stroke:#1976D2
    style F fill:#FFFDE7,stroke:#F9A825
    style G fill:#FFFDE7,stroke:#F9A825
    style H fill:#FFFDE7,stroke:#F9A825
    style I fill:#FFFDE7,stroke:#F9A825

Output Functions:

Function Purpose
print.forestsearch() Basic summary
summary.forestsearch() Detailed stats
sg_tables() gt-formatted tables
plot_subgroup_results_forestplot() Forest plots
cox_cs_fit() Spline Cox models
filter_call_args() Filter function args

11 Simulation & DGM Module

11.1 Data Generation for Simulations

flowchart TB
    subgraph Simulation["Simulation Workflow"]
        A["Define DGM\nParameters"] --> B["generate_aft_dgm_flex()"]
        B --> C["simulate_from_dgm()"]
        C --> D["Simulated\nTrial Data"]
        D --> E["run_forestsearch_analysis()"]
        E --> F["Evaluate\nPerformance"]
    end
    
    style A fill:#E3F2FD,stroke:#1976D2
    style D fill:#FFF3E0,stroke:#F57C00
    style F fill:#E8F5E9,stroke:#388E3C

Simulation Functions:

Function Location Purpose
generate_aft_dgm_flex() generate_aft_dgm_flex.R Create AFT DGM
simulate_from_dgm() simulate_from_dgm.R Generate data
run_mrct_simulation() mrct_simulation.R MRCT simulation
run_single_simulation() oc_analyses_gbsg.R Single iteration

12 Parallel Processing

ForestSearch supports multiple parallel backends:

flowchart LR
    subgraph Backends["Parallel Backends"]
        A["sequential"] --> E["Single Thread"]
        B["multisession"] --> F["Background R Sessions"]
        C["multicore"] --> G["Fork Processes\n(Unix/Mac)"]
        D["callr"] --> H["Separate R Processes"]
    end
    
    style A fill:#ECEFF1,stroke:#546E7A
    style B fill:#E8F4FD,stroke:#2E86AB
    style C fill:#E8F5E9,stroke:#388E3C
    style D fill:#FFF3E0,stroke:#F57C00


13 Package Dependencies

flowchart TB
    subgraph Stats["Core Statistical"]
        survival
        grf
        glmnet
        policytree
    end
    
    subgraph Data["Data Handling"]
        data.table
        stringr
    end
    
    subgraph Par["Parallel"]
        future
        doFuture
        foreach
        callr
    end
    
    subgraph Viz["Visualization"]
        ggplot2
        gt
        forestploter
    end
    
    FS["forestsearch"] --> Stats
    FS --> Data
    FS --> Par
    FS --> Viz
    
    style FS fill:#2E86AB,stroke:#1a5276,color:#fff


14 File Organization

R/
├── forest_search_revised.R          # Main forestsearch()
├── get_FSdata_refactored.R          # Data preparation
├── get_FSdata_helpers.R             # Data prep helpers
├── subgroup_search.R                # Exhaustive search
├── subgroup_consistency_main.R      # Consistency evaluation
├── subgroup_consistency_helpers.R   # Consistency helpers
├── grf_main.R                       # GRF subgroup ID
├── grf_helpers.R                    # GRF helpers
├── bootstrap_dofuture_main.R        # Bootstrap main
├── bootstrap_analysis_dofuture.R    # Bootstrap iteration
├── bootstrap_summaries_helpers.R    # Bootstrap summary
├── summarize_bootstrap_subgroups.R  # Subgroup summary
├── forestsearch_cross-validation.R  # K-fold CV
├── summary_utility_functions.R      # Output utilities
├── summary_forestsearch.R           # S3 summary
├── plot_subgroup_results_forestplot.R # Forest plots
├── cox_spline_fit.R                 # Spline Cox
├── generate_aft_dgm_flex.R          # DGM generation
├── simulate_from_dgm.R              # Simulation
├── mrct_simulation.R                # MRCT sim
└── oc_analyses_gbsg.R               # OC analyses

15 Typical Workflow Example

# 1. Run main analysis
fs_result <- forestsearch(
  df.analysis = trial_data,
  confounders.name = c("age", "biomarker", "stage"),
  hr.threshold = 1.25,
  pconsistency.threshold = 0.90,
  use_grf = TRUE,
  use_lasso = TRUE,
  details = TRUE
)

# 2. Bootstrap bias correction
fs_boot <- forestsearch_bootstrap_dofuture(
  fs.est = fs_result,
  nb_boots = 1000,
  parallel_args = list(plan = "callr", workers = 6)
)

# 3. Cross-validation
fs_cv <- forestsearch_Kfold(
  fs.est = fs_result,
  Kfolds = 10
)

# 4. Summary and visualization
print(fs_result)
sg_tables(fs_result)

16 References

  • León LF, et al. (2024). Exploratory subgroup identification in the heterogeneous Cox model. Statistics in Medicine.
  • Athey S, Imbens G (2016). Recursive partitioning for heterogeneous causal effects. PNAS.
  • Wager S, Athey S (2018). Estimation and inference of heterogeneous treatment effects using random forests. JASA.

Document generated for ForestSearch R package - CRAN submission preparation